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Abstract This article concerns the following question arising in computational evolutionary biology. 
For a given subclass of phylogenetic networks, what is the maximum value of < p < 1 such that for 
every input set T of rooted triplets, there exists some network N{T) from the subclass such that at least 
p\T\ of the triplets are consistent with M{T)1 Here we prove that the set containing all triplets (the full 
triplet set) in some sense defines p, and moreover that any network N achieving fraction p' for the full 
triplet set can be converted in polynomial time into an isomorphic network J\f' (T) achieving fraction 
> p' for an arbitrary triplet set T. We demonstrate the power of this result for the field of phylogenetics 
by giving worst-case optimal algorithms for level-1 phylogenetic networks (a much-studied extension 
of phylogenetic trees), improving considerably upon the 5/12 fraction obtained recently by Jansson, 
Nguyen and Sung in [13]. For level-2 phylogenetic networks we show that p > 0.61. We note that all 
the results in this article also apply to weighted triplet sets. 

1 Introduction 

One of the most commonly encountered problems in computational evolutionary biology is to plau- 
sibly infer the evolutionary history of a set of species, often abstractly modelled as a tree, using 
obtained biological data. Existing algorithms for directly constructing such a tree do not scale well 
(in terms of running time) and this has given rise to supertree methods: first infer trees for small 
subsets of the species and then puzzle them together into a bigger tree such that in some well- 
defined sense the information in the subset trees is preserved |5]. In the fundamental case where 
the subsets in question each contain exactly three species - subsets of two or fewer species cannot 
convey information - we speak of rooted triplet methods. 

In recent years improved understanding of the complex mechanisms driving evolution has stim- 
ulated interest in reconstructing evolutionary networks [9] [16] [18] [23] . Such structures are more 
general than trees and allow us to capture the phenomenon of reticulate evolution i.e. non tree-like 
evolution caused by processes such as hybrid speciation and horizontal gene transfer. A natural ab- 
straction of reticulate evolution, used already in several papers, is to permit recombination vertices, 
vertices with indegree greater than one. Informally a level-k phylogenetic network is an evolutionary 
network in which each biconnected component contains at most k such recombination vertices. Phy- 
logenetic trees form the base: they are level-0 networks. The higher the level of a network, the more 
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intricate the pattern of reticulate evolution that it can accommodate. Note that phylogenetic net- 
works can also be useful for visualising two or more competing hypotheses about tree-like evolution. 

Various authors have already studied the problem of constructing phylogenetic trees (and more 
generally networks) which are consistent with an input set of rooted triplets. Aho et al [1] showed 
a simple polynomial-time algorithm which, given a set of rooted triplets, finds a phylogenetic tree 
consistent with all the triplets, or shows that no such tree exists. For the equivalent problem in 
level-1 and level-2 networks the problem becomes NP-hard [I2][l4j, although the problem becomes 
polynomial-time solveable if the input triplets are dense i.e. if there is at least one triplet in the 
input for each subset of three species [T2][T5] . 

Several authors have considered algorithmic strategies of use when the algorithms from [Tj and 
|15j fail to find a tree or network. Gcisieniec et al [lOj gave a polynomial-time algorithm which 
always finds a tree consistent with at least 1/3 of the (weighted) input triplets, and furthermore 
showed that 1/3 is best possible when all possible triplets on n species (the full triplet set) are given 
as input. On the negative side, [6] [13] [23] showed that it is NP-hard to find a tree consistent with a 
maximum number of input triplets. In the context of level-1 networks, |14j gave a polynomial-time 
algorithm which produces a level-1 network consistent with at least 5/12 ~ 0.4166 of the input 
triplets. They also described an upper-bound, which is a function of the number of distinct species 
n in the input, on the percentage of input triplets that can be consistent with a level-1 network. As 
in [To] this upper bound is tight in the sense that it is the best possible for the full triplet set on 
n species. They computed a value of n for which their upper bound equals approximately 0.4883, 
showing that in general a fraction better than this is not possible. The apparent convergence of 
this bound from above to 0.4880... begs the question, however, whether a fraction better than 5/12 
is possible for level-1 networks, and whether the full triplet set is in general always the worst-case 
scenario for such fractions. 

In this paper we answer these questions in the affirmative, and in fact we give a much stronger re- 
sult. In particular, we develop a probabilistic argument that (as far as such fractions are concerned) 
the full triplet set is indeed always the worst possible case, irrespective of the type of network being 
studied (Proposition [H Corollary [1]) . Furthermore, by using a generic, derandomized polynomial- 
time (re)labelling procedure we can convert a network J\f which achieves a fraction p' for the full 
triplet set into an isomorphic network J\f'{T) that achieves a fraction > p' for a given input triplet 
set T (Theorem [1]) . In this way we can easily use the full triplet set to generate, for any network 
structure, a lower bound on the fraction that can be achieved for arbitrary triplet sets within such 
a network structure. The derandomization we give is fully general (with a highly optimized running 
time) and leads immediately to a simple extension of the 1/3 result from [10] . For level-1 networks 
we use the derandomization to give a polynomial-time algorithm which achieves a fraction exactly 
equal to the level-1 upper-bound identified in [14], and which is thus worst-case optimal for level-1 
networks. We formally prove that this achieves a fraction of at least 0.48 for all n. Finally, we 
demonstrate the flexibility of our technique by proving that for level-2 networks (see jl2j ) we can, 
for any triplet set T, find in polynomial time a level-2 network consistent with at least a fraction 
0.61 of the triplets in T (Theorem [3|) . 

We emphasize that in this article we are optimizing (and thus defining worst-case optimality) 
with respect to |T|, the number of triplets in the input, not Opt{T), the size of the optimal so- 
lution for that specific T. The latter formulation we call the MAX variant of the problem. The 
fact that Opt{T) is always bounded above by |T| implies that an algorithm that obtains a fraction 
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p' of the input T is trivially also a p'- approximation for the corresponding MAX problem. Better 
approximation factors for the MAX problem might, however, be possible. We discuss this further 
in Section [71 

The results in this article are given in terms of unweighted triplet sets. A natural extension, espe- 
cially in phylogenetics, is to attach a weight to each triplet i G T i.e. a value G Q>o denoting 
the relative importance of (or confidence in) t. In this weighted version of the problem fractions 
are defined relative to the total weight of T (defined as the sum of the weights of all triplets in T), 
not to |T|. It is easy to verify that all the results in this article also hold for the weighted version 
of the problem. 

2 Definitions 

A phylogenetic network {network for short) TV on species set X is defined as a pair {N,j) where 
N is the network topology (topology for short) and 7 is a labelling of the topology. The topology is 
a directed acyclic graph in which exactly one vertex has indegree and outdegree 2 (the root) and 
all other vertices have either indegree 1 and outdegree 2 {split vertices), indegree 2 and outdegree 1 
{recombination vertices) or indegree 1 and outdegree {leaves). A labelling is a bijective mapping 
from the leaf set of N (denoted L^) to X. Let n=\X\ = \L^\. 

A directed acyclic graph is connected (also called "weakly connected") if there is an undirected 
path between any two vertices and biconnected if it contains no vertex whose removal disconnects 
the graph. A biconnected component of a network is a maximal biconnected subgraph. 

Definition 1. A network is said to be a level- A: network if each biconnected component contains 
at most € N recombination vertices. 

We define phylogenetic trees to be the class of level-0 networks. 




X y z 



Figure 1. One of the three possible triplets on the set of species {x,y,z}. Note that, as with all 
figures in this article, all arcs are assumed to be directed downwards, away from the root. 

The unique rooted triplet (triplet for short) on a species set {x,y,z} C X in which the lowest 
common ancestor of x and y is a proper descendant of the lowest common ancestor of x and z is 
denoted by xy\z (which is identical to yx\z). For any set T of triplets define X{T) as the union of 
the species sets of all triplets in T. 

Definition 2. A triplet xy\z is consistent with a network Af (interchangeably: J\f is consistent with 
xy\z) if N contains a subdivision of xy\z, i.e. if M contains vertices u^v and pairwise internally 
vertex-disjoint paths u^x,u^y,v^u and f — > jfj. 



^ Where it is clear from the context, as in this case, we may refer to a leaf by the species that it is mapped to. 
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By extension, a set of triplets T is consistent with a network (interchangeably: A/" is consistent 
with T) iff, for all t £ T, t is consistent with M. Checking consistency can be done in polynomial 
time, see Lemmas [2] and [3l 




a b cdefghijk 



Figure 2. An example of a level-1 network. 



3 Labelling a network topology 

Suppose we are given a topology N with n leaves, and a set T of triplets where = {li,l2, ■ ■ ■ Jn} 
and X = X[T) = {xi,X2, ■ ■ ■ iXn}- The specific goal of this section is to create a labelling 7 such 
that the number of triplets from T consistent with (A^, 7), is maximized. 
Let f{N,'y,T) denote the fraction of T that is consistent with (A^, 7). 

Consider the special set Ti(n), the full triplet set, of all the possible 3(3) triplets with leaves 
labelled from {xi,X2, ■ ■ ■ ,Xn}- Observe that for this triplet set the number of triplets consistent 
with a phylogenetic network (N, 7) does not depend on the labelling 7. We may thus define = 
/(A^, 7, Ti(n)) = f{N,Ti{n)) by considering any arbitrary, fixed labelling 7. (Note that = 1/3 
if A^ is a tree topology.) 

We will argue that the triplet set Ti{n) is the worst-case input for maximizing f{N,j,T) for 
any fixed topology N on n leaves. In particular we prove the following: 

Proposition 1. For any topology N with n leaves and any set of triplets T , if the labelling 7 is 
chosen uniformly at random, then the quantity f(N,j,T) is a random variable with expected value 
E{f{N,j,T)) = #N. 

Proof. Consider first the full triplet set Ti{n) = {ti,t2, ■ ■ ■ ,t^^n^} and an arbitrary fixed labelling 
70- By labelling A'^ we fix the position of each of the triplets in A^. Formally, a position of a triplet 
t = xy\z (with respect to 70) is a triplet p = ^Q^^it) = To"^ (3^)70"^ (y)l7o'^(-^) "-"^ leaves of A^. We 
may list possible positions for a triplet in A^ as those corresponding to ti,t2i • • • )^3(") iii (-^)7o)- 
Since a ij^N fraction of ti,t2, ■ ■ ■ 1*3^"^ is consistent with (A^, 70), a ^^A^ fraction of these positions 
makes the triplet consistent. Now consider a single triplet t G T and a labelling 7 that is chosen 
randomly from the set F of n! possible bijections from to X. Observe, that for each ti € Ti(n), 
exactly 2 • (n — 3)! labellings j (z F make triplet t have the same position in (A^, 7) as ti has in 
(A^, 7o) (the factor of 2 comes from the fact that we think of xy\z and yx\z as being the same 
triplet). Any single labelling occurs with probability ^, hence triplet t takes any single position 

with probabihty ^'^"7^^' = ^^y. 

Since for an arbitrary t £ T each of the 3 • (3) positions have the same probability and ^A^ of 
them make t consistent, the probability of t being consistent with {N,j) is i^N. The expectation 
is thus that a fraction of the triplets in T are consistent with ( A^, 7) . □ 
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From the expected value of a random variable we may conclude the existence of a realization 
that attains at least this value. 

Corollary 1. For any topology N and any set of triplets T there exists a labelling 70 such that 
f{N,jo,T)>#N. 

We may deterministically find such a 70 by derandomizing the argument in a greedy fashion, 
using the method of conditional expectation (see e.g. [19j). In particular, we will show the following. 

Theorem 1. For any topology N and any triplet set T, a labelling 70 such that f{N,'yQ,T) > 
can be found in time 0{m^ + n|T|), where m and n are the numbers of vertices and leaves of N. 

We use standard arguments from conditional expectation to prove that the solution is of the 
desired quality. It is somewhat more sophisticated to optimize the running time of this derandom- 
ization procedure. We therefore dedicate the following subsection to the proof of Theorem [TJ 

3.1 An optimized derandomization procedure 

We start with a sketch of the derandomization procedure. 

1. 7^ <— set of all possible labellings. 

2. while there is a leaf I whose label is not yet fixed, do: 

— for every species x that is not yet used 

• let Fx be the set of labellings from F where / is labeled by x 

• compute E{f{N,'yx,T)), where 'jx is chosen randomly from Fx 

- F ^ Fx s.t. x = argmaXxE{f{N,'yx,T)) 

3. return 70 <— the only element of F. 

The following lemma shows that the solution produced is of the desired quality. 
Lemma 1. f{N,jo,T) > #N. 

Proof. By Proposition [T] the initial random labelling 7 has the property i?(/(iV, 7, T)) = H^N. 
It remains to show that this expectation is not decreasing when labels of leaves get fixed during 
the algorithm. Consider a single update F <— Fx the range of the random labelling. By the 
choice of the leaf / to get a fixed label we choose a partition of F into blocks Fx- The expectation 
-E(/(iV, 7, r)) is an average of f(N,'~f,T) over F, and at least one of the blocks Fx of the partition 
has this average at least as big as the total average. Hence, by the choice of Fx with the highest 
expectation of /(iV, 7^, T), we get E{f{N,jx,T)) > E{f{N,-f,T)). □ 

We will now propose an efficient implementation of the derandomization procedure. Since the 
procedure takes a topology as an input, we need to express the running time of this procedure 
in terms of the size of N. We will use m = \V{N)\ for the number of vertices of A^, and n for the 
number of leaves. 

We will need a generalized definition of consistency. In Definition [2] it is assumed that x,y and 
z are leaves of the network Af. In the following we will use a predicate consistent{x , y, z) defined 
on vertices of TV which coincides with Definition [2] when x, y and z are leaves. The predicate 
consistent{x, y, z) is defined to be true if TV contains vertices u ^ v and pairwise internally vertex- 
disjoint paths u^x,u^y,v^u and t> — >■ paths are allowed to be of length (e.g. u = x), 
but vertices x, y and z need to be all different. We begin with a consistency checking algorithm. 
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Lemma 2. Given a network M with m vertices, we can preprocess it in time 0{m^) so that given 
any three vertices x,y,z we can check whether xy\z is consistent with J\f in time 0(1). 

Proof. Assume that vertices of M are numbered according to a topological order so that when- 
ever {u, v) is an edge, u < v. Observe that consistent{x, y, z) can be expressed in terms of 
consistent{x' ,y' , z') where m.ax{x' ,y' , z') < max{x, y,z) as follows: 

1. x,y < z. Then consistent{x , y, z) holds iff for some z' ^ x,y {z', z) is an edge and consistent{x, y, z') 
holds. 

2. x,z < y. Then consistent{x, y, z) holds iff (x, y) is an edge and join{x, z) holds or for some 
y' x,z {y', y) is an edge and consistent{x, y', z) holds. 

3. y,z < x. Then consistent{x, y, z) holds iff (y, x) is an edge and join{y, z) holds or for some 
x' ^ y,z {x', x) is an edge and consistent{x' , y, z) holds. 

where join{x, z) is a predicate stating that Af contains t ^ x and internally vertex-disjoint 
paths t — > X and t ^ z. It can be verified in linear time, for example by checking if there is a path 
r — > z (where r is the root) after removing x from the network. 

So, determining all consistent triplets can be done by calculating all consistent{x,y, z). The 
number of operations we have to perform for each x, y, z is not greater than sum of indegrees of 
those vertices which makes the overall complexity 0(m^ + m?\E{Af)\) = O(m^). □ 

Although we do not use the result in this article, we note that the above lemma can be strength- 
ened to obtain the following, which is proved in the appendix. 

Lemma 3. Given a level-k network M , we can preprocess it in time 0{m + mk'^) so that given any 
three vertices x,y,z we can check whether xy\z is consistent with Af in time 0(1). 

Armed with Lemma [2] (or Lemma [3]) we are now ready to proceed with the derandomization. 
Assume that we assign labels to leaves in order of their position in the topological ordering of 
Lemma [21 for simplicity let us refer to the leaves as 1,2, ... ,n. The most time-consuming part of 
the algorithm is calculating the probability that a given triplet t = ab\c is consistent with a random 
labelling having already specified the labels of leaves 1, 2, . . . , fc — 1. Let 7 be this partial labelling. 
We need to consider six cases: 

1. Labels a, b, c have not been assigned yet. The probability is then simply the number of consistent 



2. 7(2;) = a but both b and c have not been assigned yet. The probability is the number oik <y,z 



3. 7(z) = c but both a and b have not been assigned yet. The probability is the number oik < x < y 
such that xy\z is a consistent triplet divided by 

4. 7(2;) = a,'y{y) = b but c has not been assigned yet. The probability is the number oi k < z such 
that xy\z is a consistent triplet divided by n — + 1. 

5. 7(x) = a, 7(z) = c but b has not been assigned yet. The probability is the number of k < y such 
that xy\z is a consistent triplet divided by n — A; + 1. 

6. 7(x) = a, 7(2/) = b,^{z) = c. The probability is or 1, depending on whether xy\z is consistent 
or not. 
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In the first five cases, we can do rather better than simply counting all the xy\z each time from 
scratch. Define: 



count3{k) 
count2x{k, x) 
count2z{k, z) 
countlxy{k, x, y) 
countlxz{k, X, z) 



= #(x,y,z) such that k < x,y,z and x < y and consistent{x, y, z) 
= #(y,z) such that k <y,z and consistent{x,y, z) 
= #(x,y) such that k < x < y and consistent{x, y, z) 
= #z such that k < z and consistent{x, y, z) 
= #y such that k < y and consistent{x, y, z) 



It is easy to see that we can compute all the above values in time 0{m^) as follows: 

count3{k) = count3{k + 1) + count2x{k + 1, /c) + count2z{k + 1, k) 
count2x{k, x) = count2x{k + 1, x) + countlxz{k + 1, x, k) + countlxy{k + 1, x, k) 
count2z(k, z) = count2z{k + 1,2;)+ countlxz{k + 1, k, z) 

countlxy{k, x, y) = countlxy{k + 1, x, y) + consistent{x, y, k) 

countlxz{k, X, z) = countlxz{k + 1, x, z) + consistent{x, k, z) 

Having preprocessed the above values, we can calculate each probability in time 0(1), giving us 
a total running time of 0{rrfi + n^|T|). It turns out, however, that we can do slightly better. As it 
currently stands, for each leaf and each unused label we calculate the expected number of consistent 
triplets after assigning this label to this leaf separately. To further improve the complexity of the 
running time we can try to do all such calculations at once (for a fixed leaf): for each triplet and 
for each unused label we calculate the probability that this triplet is consistent after we use this 
label. (In other words, we switch from a leaf-label-triplet loop nesting to leaf-triplet-label). Then it 
turns out that those probabilities are the same for almost all unused labels. More formally: 

Lemma 4. Let F be the set of labellings that assign to leaf i for each i = 1,2, . . . ,k — 1 and 
Fx be the subset of F that assign x to leaf k. We can find x for which E(f(N, 'jx, T)) is maximum 
in time 0{\T\) assuming the above preprocessing. 

Proof Let = E{f{N, 7a;, T)). Each such ex is a sum of probabilities corresponding to the elements 
of T. We will start with all equal and consider triplets one by one. For each triplet t we will 
increase the appropriate values of ex by the corresponding probabilities. Again we should consider 
six cases depending on how t = ab\c looks like. Let A = .„_\_).is , B = fn-\+i\ ^^id C — ^ • 

^( 3 j [ 2 ) 



n-fe+1 ■ 



1. Labels a, b, c have not been assigned yet. We should increase Ca and by count2x{k + l,k)^, 
Cc by count2z{k + 1, k)B and the remaining e; by count3{k + 1)^. 

2. 7(x) = a but both h and c have not been assigned yet. We should increase e;, by countlxy{k + 
1, a, k)C, Cc by countlxz{k + 1, x, k)C and the remaining e; by cou,nt2x{k + 1, x)y . 

3. 7(2;) = c but both a and b have not been assigned yet. We should increase Ca and by 
countlxz{k + 1, k, z)C and the remaining e/ by count2z{k, z)B. 

4. 7(x) = a, 7(y) = b but c has not been assigned yet. We should increase Cc by consistent{x , y, k) 
and the remaining ei by countlxy{k + l,x,y)C. 

5. 7(x) = a, 7(2;) = c but b has not been assigned yet. We should increase e;, by consistent{x, k, z) 
and the remaining ei by countlxz(k + 1, x, z)C. 

6. 7(x) = a,^{y) = b,^{z) = c. We should increase all ei by consistent{x,y, z). 
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The nai've implementation of the above procedure would require time 0{\T\n). We can improve 
it by observing that we are interested only in the relative increment of the different , not in their 
actual values. So, instead of increasing all ei with / ^ S* by some 6 (for some S C X), we can 
decrease all ei with I G S hy this S. Then processing each triplet takes time 0(1) as we only have 
to change at most 3 values of e^. 

□ 

We may now compose these lemmas into the following. 

Proof (of Theorem Ul). Consider the procedure sketched at the beginning of the subsection and 
implemented according to the above lemmas. By Lemma[T]it produces good labellings. By Lemma[2] 
and n applications of Lemma [U it can be implemented to run in 0{m^ + n\T\) time. □ 

3.2 Consequences of Theorem [1] 

The above theorem gives a new perspective on the problem of approximately constructing phyloge- 
netic networks. From the algorithm of Gcisieniec et al. [10] we can always construct a phylogenetic 
tree that is consistent with at least 1/3 of the the input triplets. In fact, the trees constructed by 
this algorithm are very specific - they are always caterpillars. (A caterpillar is a phylogenetic tree 
such that, after removal of leaves, only a directed path remains.) Theorem [T] implies that not only 
caterpillars, but all possible tree topologies have the property, that given any set of triplets we 
may find in polynomial time a proper assignment of species into leaves with the guarantee that the 
resulting phylogenetic tree is consistent with at least a third of the input triplets. 

The generality of Theorem [1] makes it meaningful not only for trees, but also for any other sub- 
class of phylogenetic networks (e.g. for level-Zc networks). Let us assume that we have focussed our 
attention on a certain subclass of networks. Consider the task of designing an algorithm that for a 
given triplet set constructs a network from the subclass consistent with at least a certain fraction of 
the given triplets. A worst-case approach as described in this section will never give us a guarantee 
better than the maximum value of i^N ranging over all topologies N in the subclass. Therefore, 
if we intend to obtain networks consistent with a big fraction of triplets and if our criteria is to 
maximize this fraction in the worst case, then our task reduces to finding topologies within the 
subclass that are good for the full triplet set. Theorem [T] potentially has a further use as a mech- 
anism for comparing the quality of phylogenetic networks generated by other methods, because it 
provides lower bounds for the fraction of T that a given topology and/or subclass of topologies 
can be consistent with. (Although a fundamental problem in phylogenetics [2] [7| [8] [20] [T7] the 
science of network comparison is still very much in its infancy.) 

For level-0 networks (i.e. phylogenetic trees) the problem of finding optimal topologies for the 
full triplet set is simple: any tree is consistent with exactly 1/3 of the full triplet set. For level-1 
phylogenetic networks a topology that is optimal for the full triplet set was constructed in [14j. 
We may use this network and Theorem [1] to obtain an algorithm that works for any triplet set and 
creates a network that is consistent with the biggest possible fraction of triplets in the worst case 
(see Section m for more details). For level-2 networks we do not yet know the optimal structure of 
a topology for the full triplet set, but we will show in Section [5] that we can construct a network 
that has a guarantee of being consistent with at least a fraction 0.61 of the input triplets. 



9 



4 Application to level-1 phylogenetic networks 

4.1 A worst-case optimal polynomial-time algorithm for level-1 networks 

In it was shown how to construct a special level-1 topology C(n), which we call a galled 
caterpilla-^, such that ^C{n) > for all level-1 topologies N on n leaves. The existence of 
C{n), which has a highly regular structure, was proven by showing that any other topology N can 
be transformed into C(n) by local rearrangements that never decrease the number of triplets the 
associated network is consistent with. It was shown that #C7(n) = S{n)/3[^), where 5(0) = 5(1) = 
5(2) = and, for n > 2, 




Figure 3. This is galled caterpillar C(17). It contains two galls and ends with a tail of two leaves. 
C(17) contains 11 leaves in the top gall because Equation [1] is maximised for a = 11. 

In Figure [3] an example of a galled caterpillar is shown. All galled caterpillars on n > 3 leaves 
consist of one or more galls chained together in linear fashion and terminating in a tail of one or 
two leaves. Observe that the recursive structure of C(n) mirrors directly the recursive definition of 
5(n) in the sense that the value of a chosen at recursion level k is equal to the number of leaves 
found in the kth gall, counting downwards from the root. In the definition of C(n) it is not specified 
how the a leaves at a given recursion level are distributed within the gall, but it is easy to verify 
that placing them all on one side of the gall (as shown in the figure) is sufficient. 

Lemma 5. Let T be a set of input triplets labelled by n species. Then in time 0{n^ + '^I^D ^-^ 
possible to construct a level-1 network Af, isomorphic to the galled caterpillar C (n) , consistent with 
at least a fraction S{n)/3(^) ofT. 

Proof. First we construct the level-1 topology C{n). Using dynamic programming to compute all 
values of S{n') for < n' < n we can do this in time O(n^). Note that C(n) contains in total 
0{n) vertices. It remains only to choose an appropriate labelling of the leaves of C{n), and this is 
achieved by substituting C(n) for in Theorem [H this dominates the running time. □ 



In [14] this is called a caterpillar network. 
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Note that, because C(n) achieves the best possible fraction for the input Ti{n), the fraction 
achieved by Lemma [5] is worst-case optimal for all n. Empirical experiments suggest that the func- 
tion 5(n)/3(3) is strictly decreasing and approaches a horizontal asymptope of 0.4880... from above; 
for values of n = 10\ 10^, 10^ 10"^ the respective ratios are 0.511..., 0.490..., 0.4882..., 0.4880.... It is 
difficult to formally prove convergence to 0.4880... so we prove a slightly weaker lower bound of 0.48 
on this function. From this it follows that in all cases the algorithm described in Lemma [5] is guar- 
anteed to produce a network consistent with at least a fraction 0.48 of T, improving considerably 
on the 5/12 « 0.4166 fraction achieved in [14] . 

Lemma 6. S{n)/3{f) > 0.48 for all n > 0. 

Proof. This can easily be computationally verified for n < 116, we have done this with a computer 
program written in Java [25]. To prove it for n > 116, assume by induction that the claim is true 
for all n' < n. Instead of choosing the value of a that maximises S{n) we claim that setting a equal 
to z = [2n/3j is sufficient for our purposes. We thus need to prove the following inequality: 

(-)+2(-)(n-.) + z(V)+5(n-z) ^ 
3(3) 

Combined with the fact that, by induction, S{n — z)/3(^^^) > 48/100, it is sufficient to prove that: 



©+2©(n-z)+.(V) + 144/100(V) 



3(3) 

Using Mathematica we rearrange the previous inequality to: 



> 48/100 



[2n/3j ( 22 + 9n + 33n^ - 6(7 + 18n) [2n/3j + 86[2n/3j 
n(2 — 3n + v?) 



< 



Taking (2n/3) — 1 as a lower bound on [2n/3j , and 2n/3 as an upper bound, it can be easily verified 
that the above inequality is satisfied for n > 116. □ 

To conclude this section we combine Lemmas [5] and [6] into the following Theorem. 

Theorem 2. Let T be a set of input triplets labelled by n species. In time 0{n'^ + n\T\) it is possible 
to construct a level-1 network Af consistent with at least a fraction S{n)/3{^^ > 0.48 ofT, and this 
is worst-case optimal. 



5 A lower bound for level-2 networks 

Theorem 3. Let T be a set of input triplets labelled by n species. It is possible to find in polynomial 
time a level-2 network M {T) such that Af{T) is consistent with at least a fraction 0.61 ofT. 

Proof. We prove this by induction. For n < 16813 we use a computational proof. For n > 16813 
we use Mathematica to show that, assuming the induction base, a fraction 0.61 can be achieved 
by repeatedly chaining together a very basic type of level-2 network into some kind of "level-2 
caterpillar" ; the details are deferred to the appendix. □ 
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6 The complexity of optimisation 

Given a topology and a set of triplets T, the techniques described in this article guarantee to 
find a labelling 7 such that f{N,j,T) > i^N. It is natural to explore the complexity of finding, 
in polynomial time, (approximations to) an optimal labelling of N for a particular triplet set T. 
The observation below rules out (under standard complexity-theoretic assumptions) the existence 
of a Polynomial-Time Approximation Scheme (PTAS) for this problem. Secondly we observe that 
a PTAS for the problem MAX-LEVEL-0 (which carries the name MCTT in [21]) can also be ruled 
out. We discuss the consequences of this in the next section. 

Problem: MAX-LEVEL-O-LABELLING 

Input: A level-0 topology N (i.e. a topology of a phylogenetic tree) and a set T of rooted triplets. 
Output: The maximum value of s such that there exists a labelling 7 of making (N, 7) consis- 
tent with at least s triplets from T. 

Problem: MAX-LEVEL-0 
Input: A set T of rooted triplets. 

Output: The maximum value of s such that there exists a level-0 network TV (i.e. a phylogenetic 
tree) consistent with at least s triplets from T. 

Observation 1 MAX-LEVEL-0 and MAX-LEVEL-O-LABELLLNG are both APX-complete. 

Proof. By approximation-preserving reduction from MAXIMUM SUBDAG / LINEAR ORDER- 
ING. See appendix. □ 

7 Conclusions and open questions 

With Theorem [1] we have described a method which shows how, in polynomial time, good solutions 
for the full triplet set can be efficiently converted into equally good, or better, solutions for more 
general triplet sets. Where best-possible solutions are known for the full triplet set, this leads to 
worst-case optimal algorithms, as demonstrated by Theorem [2j An obvious next step is to use 
this method to generate algorithms (where possible worst-case optimal) for wider subclasses of 
phylogenetic networks. Finding the/an "optimal form" of level-2 networks for the full triplet set 
remains a fascinating open problem. 

From a biological perspective (and from the perspective of understanding the relevance of triplet 
methods) it is also important to attach meaning to the networks that the techniques described in 
this paper produce. For example, we have shown how, for level- 1 networks, we can always find a 
network isomorphic to a galled caterpillar which is consistent with at least a fraction 0.48 of the 
input. If we do this, does the location of the species within this galled caterpillar communicate any 
biological information? Also, what does it say about the relevance of triplet methods, and especially 
the level-A: hierarchy, if we know a priori that a large fraction (already for level 2 more than 0.61) 
of the input can be made consistent with some network from the subclass? And, as discussed in 
Section 13.21 how far can the techniques described in this paper be used as a quality measure for 
networks produced by other algorithms? 

As mentioned in the introduction, an algorithm guaranteed to find a network consistent with a 
fraction p' of the input trivially becomes a p'- approximation for the MAX variant of the problem 
(where we optimise not with respect to \T\ but with respect to the size of the optimal solution 
for T.) In fact, the best-known approximation factor for MAX-LEVEL-0 is 1/3, a trivial extension 
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of the fact that p = 1/3 for trees |10j . On the other hand, the APX-hardness of this problem 
implies that an approximation factor arbitrarily close to 1 will not be possible. It remains a highly 
challenging open problem to determine whether better approximation factors can be obtained for 
the latter problem via some different approach. (We note that an approximation factor / > 1/2 
would be a major breakthrough because, by the proof of Observation [H this would improve upon 
the long-standing best-known approximation factor of 1/2 for the problem LINEAR ORDERING 
[21j.) Alternatively, there could be some complexity theoretic reason why approximation factors 
better than p (where p is optimal in our formulation) are not possible. Under strong complexity- 
theoretic assumptions the best approximation factor possible for MAX-3-SAT, for example, uses a 
trivial upper bound of all the clauses in the input |llj, analogous perhaps to using \T\ as an upper 
bound. 
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A Appendix 

In this section we give the fuh proofs for Lemma [3] from Section [3l Theorem [3] from Section [5] and 
Observation [T] from Section [H 



C 




Figure 4. The case LCA{A, B) = LCA{A, B, C) from the proof of Lemma El 

Lemma [3l Given a level-k network Af , we can preprocess it in time 0{m + mk'^) so that given any 
three vertices x,y,z we can check whether xy\z is consistent with Af in time 0(1). 

Proof. We begin with a definition and two claims. A chain is a sequence of vertices — > — >• 
. . . Vk+i such that both in and out degrees vi, . . . ,Vk are ah equal 1. Now, consider an arbitrary 
biconnected component in M. We claim that, if each chain within the component is replaced by 
an edge, the transformed component contains at most max(2, 3k) vertices. This is trivially true for 
the biconnected component consisting of only a single edge. To see that it is more generally true, 
note that the transformed component contains at most k recombination vertices and that all other 
vertices have outdegree 2 and indegree at most 1. Each such non-recombination vertex creates at 
least one new path that eventually has to terminate in a recombination vertex, and a recombination 
vertex can terminate at most two paths. So there are at most 2k non-recombination vertices, and 
the claim follows. 

From now on biconnected component (or simply component) refers to a biconnected component 
containing more than one edge. We claim that all biconnected components within J\f are vertex- 
disjoint. To see why this is, note that each vertex in such a component must have degree at least 
2. The value of indegree plus outdegree of all vertices in M is at most 3, so they cannot be a part 
of more than one such component. 

The above reasoning shows that we could imagine the network as a rooted tree T in which each 
vertex corresponds to some bigger component that has relatively simple structure: contracting all 
chains in it gives us a graph of size 0{k). 

First we focus on the case when x,y,z lie in one biconnected component. The obvious solution 
is to simply preprocess such queries for each triple of vertices from one biconnected component. 
This does not give us the desired complexity yet: while each component is small after contracting 
chains, it might originally contain as many as Q{m) vertices. We may overcome this difficulty by 
observing that if some chain consist of more than 3 inner vertices, we can replace it by a chain 
containing exactly 3 of them. Then we preprocess the resulting graph using the 0(|yp) algorithm 
from Lemma [2] (observe that \V\ = 0{k)). Given a query concerning consistency of some xy\z we 
may have to replace some of x, y, z with other vertices lying on the shortened chains, which can be 
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easily done in time 0(1). The whole preprocessing takes time Yli^i^f) where all ki = 0{k) and 
J2i = 0{m), giving us the desired complexity. 

Now we can solve the general case. Let A, B, C be biconnected components such that x € A, 
y (z B and z G C. We can preprocess T in linear time so that given its two vertices we can find 
their lowest common ancestor (LCA) in time 0(1) [3j. (This is already sufficient for the case A; = 
and contributes the m term in the running time.) If xy\z is consistent there are only two possible 
situations: 

1. LCA{A, B, C) is a proper ancestor of LCA{A, B). This is easy to detect. 

2. LCA{A,B) = LCA{A,B,C). Let D = LCA{A,B). We can find x',y',z' - the entrance points 
within D (see Figure S]) - in time 0(1). It can be done by either using level ancestor queries 
or modifying the LCA algorithm [4J. We can then check whether x'y'\z' is consistent using the 
above preprocessing. 

□ 




Figures. We construct the network LB2{n) by repeatedly chaining the structure on the left, a 
simple level-2 network (see [12])) together to obtain an overall topology resembling the structure 
on the right. 

Theorem [S]. Let T he a set of input triplets labelled by n species. It is possible to find in polynomial 
time a level-2 network M(T) such that Af{T) is consistent with at least a fraction 0.61 ofT. 

Proof. We prove the theorem by showing how to construct a topology, which we call LB2{n), 
consistent with at least 0.61 of the triplets in Ti{n). Using Theorem [T]Li?2('i-) can then be labelled 
to obtain the network 7V(r). We show by induction how LB2{n) can be constructed. We take 
n < 16813 as the induction base; for these values of n we refer to a simple computational proof 
written in Java We now prove the result for n > 16813. Let us assume by induction that, for 
any n' < n, there exists some topology LB2{n') such that #LB2{n') > 0.61. If we let t{n') equal 
the number of triplets in Ti(n') consistent with LB2{n'), we have that t(n')/3( 3) > 0.61 and thus 
that t{n') > 1.83(3'). Consider the structure in Figured For S G {A,B,C,D,E}, we define the 
operation hanging I leaves from side S as replacing the edge S with a directed path containing I 
internal vertices, and then attaching a leaf to each internal vertex. We construct LB2{n) as follows. 
We create a copy of the structure from the figure and hang c = [0.385nJ leaves from side C, 
d = [0.07nJ from side D and e = [0.26nJ from side E. We let / = [0.285nJ and add the edge 
(F, r), where r is the root of the network LB2{f). Finally we hang a = n— {c + d + e + f) leaves 
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from side A; it might be that o = 0. (The only reason we hang leaves from side A is to compensate 
for the possibility that c + d + e + f does not exactly equal n.) This completes the construction 
of LB2 (n) ; note that as in Section U] the network is constructed by recursively chaining the same 
basic structure together. 

We can use Mathematica to show that LB2{n) is consistent with at least 0.61 of the triplets 
in Ti{n). In particular, by explicitly counting the triplets consistent with LB2{n) we derive an 
inequality expressed in terms of n,c,d,e, f,t{f), which Mathematica then simplifies to a cubic 
inequality in n that holds for all n > 16813. (To simplify the inequality we take x — 1 as a 
lower bound on [xj and assume that no leaves are hung from side A). The Mathematica script is 
reproduced in Figure [6l and can be downloaded from [25| . Finally, we comment that the networks 
computationally constructed for n < 16813 are, essentially, built in the same way as the networks 
described above. The only difference is that, to absorb inaccuracies arising from the floor function, 
we try several possibilities for how many leaves should be hung from each side; for side C, for 
example, we try also (c — 1) and (c + 1) leaves. □ 

c[n] = ( (385 / 1000) *n) - 1 

d[n] = ( (70 / 1000) *n) - 1 

e[n] = ( (260 / 1000) * n) - 1 

f [n] = ( (285 / 1000) *n) - 1 

t[n] = (3 * (610 / 1000) * Binomial [f[n] , 3] ) 

Simplify [ 

( Binomial [c [n] , 3] + Binomial [d[n] , 3] + Binomial [e [n] , 3] +t[n] + (Binomial [c [n] , 2]*d[n]) + 
(Binomial [c [n] , 2] * e [n] ) + (Binomial [c [n] , 2] * f [n] ) + (c [n] * d[n] * e [n] ) + 
(c[n] *d[n] *f[n]) + (Binomial [c [n] , 2] *e[n] ) + (c[n] *e[n] *d[n] ) + 
(c[n] *e[n] * f [n] ) + (Binomial [c [n] , 2] * f [n] ) + (c [n] * f [n] * d [n] ) + 
(c[n] * f [n] *e[n] ) + (Binomial [d [n] , 2] *c[n] ) + (Binomial [d[n] , 2] *e[n] ) + 
(Binomial [d[n], 2] *f[n]) + (d[n] *f[n] *c[n]) + (Binomial [d[n] , 2] * f [n] ) + 
(d[n] *f[n] *e[n]) + (Binomial [e [n], 2]*c[n]) + (Binomial [e [n], 2] *d[n]) + 
(Binomial [e[n] , 2] * f [n] ) + (e [n] * f [n] * c [n] ) + (e[n] * f [n] *d[n] ) + 
(Binomial [e [n] , 2] *f[n]) + (Binomial [f [n] , 2] *c[n]) + (Binomial [f [n] , 2] *d[n]) + 
(Binomial[f [n] , 2]*e[n])) / (3 * Binomial [n, 3]) > (610 / 1000) ] 

-147984000000 + 94041640000n - 16470260400 + 979319 n^ 

> 

n ( 2 - 3 n + ) 

r -147 984 000 000 + 94 041 640000 n - 16470260400 n^ + 979319n^ , 

Reduce > 0, n, Integers 

^ n(2-3n + n2) ^ 

n e Integers &&(n<-l||n>16 813) 

Figure 6. The Mathematica script used in the proof of Theorem [3l 
Observation [B MAX-LEVEL-0 and MAX-LEVEL-O-LABELLING are both APX-complete. 

Proof. By Theorem 1 we may label any tree topology to make it consistent with 1/3 of the given 
triplets. Therefore, both problems are in the class APX. To prove APX-hardness we use a reduction 
proposed by Wu [24J and we show that it is actually an L-reduction from the MAXIMUM SUBDAG 
problem. Both L-reductions and the MAXIMUM SUBDAG problem were studied by Papadimi- 
tiou and Yannakakis [22]. MAXIMUM SUBDAG, which is equivalent to the problem LINEAR 
ORDERING, has been proven APX-complete [21] [22]. 

In the MAXIMUM SUBDAG problem we are given a directed graph G = {V,A), and the goal 
is to find a maximal cardinality subset of arcs A' C A such that G' = {V, A') is acyclic. 
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In the reduction of Wu one constructs an instance of the MAX-LEVEL-0 problem as follows. 
Given a directed graph G = {V,A)^ let x ^V, consider the set of triplets T containing a single 
triplet tuv = ux\v for every arc {u,v) € A, where X = X{T) = V \J {x}. To argue that it is an 
L-reduction it remains to prove the following two claims. 

1) If there exists a subset of arcs A' C A such that G' = (y,A') is acyclic and \A'\ = k, then 
there exists a phylogenetic tree consistent with at least k triplets from T. To prove this claim, we 
consider a topological sorting of vertices in graph G' . We construct the phylogenetic tree to be a 
caterpillar with the leaves labeled (bottom-up) by such sorted vertices, the lowest leaf is labelled 
by X. It remains to observe that for any arc (u, v) £ A' the corresponding triplet tuv is consistent 
with the obtained phylogenetic tree. 

2) Given a phylogenetic tree B consistent with / triplets from T, we may construct in polynomial 
time a subset of arcs A' C A such that G' = {V,A') is acyclic and \A'\ = I. In fact we will show that 
it suffices to take A' consisting of the arcs (u, v) such that the corresponding triplet tuv is consistent 
with B. We only need to argue that for such a choice of A' the resulting graph G' = {V, A') is acyclic. 
Consider the path in the tree B from the root node to the leaf labeled by the special species x. 
For any vertex v £ A, the species v has an internal node on this path where he branched out of 
the evolution of x, namely the lowest common ancestor of u and x {LCA{u,x)). Observe, that 
the position of LCA{u,x) induces a partial ordering >g on A. Recall, that if a triplet tw = ux\v 
is consistent with B, then LCA{u,x) is a proper ancestor of LGA{v,x). Therefore, the consistent 
triplets from T induce another partial ordering that may be extended to >b. This implies that for 
A' containing the arcs {u, v) such that a triplet tw is consistent with B the graph G' = {V, A') is 
acyclic. 

With the above construction we have shown that the existence of an e-approximation algorithm 
for the MAX-LEVEL-0 problem implies existence of an e-approximation algorithm for the MAX- 
IMUM SUBDAG problem. In particular, existence of a Polynomial-Time Approximation Scheme 
(PTAS) for MAX-LEVEL-0 would imply existence of PTAS for MAXIMUM SUBDAG, which is 
unlikely due to the results in [2l] and [22] . 

In the reduction we might have assumed a particular topology for the tree. Namely, we might 
have assumed, that the topology needs to be a caterpillar. Therefore, the problem MAX-LEVEL- 
0-LABELLING is also APX-hard. □ 



